Hg19 and hg38 analysis from same workflow branch #8

NagaComBio · 2021-03-16T10:34:15Z

Merging hg38 developments into the main (master) branch.

Main updates

Hg38 related files are in a separate analysis XML
Hg19 changes
- EVS and ExAC are removed from the annotation in hg19 since they were not used for no-control filtering any changes in the high confidence somatic variants. But this changes the column length in the output file.
- Default local control AF threshold to 0.05 from 0.01, this might increase the number of final high confidence variants.
- Used user-defined AF thresholds for the 'SNP_germline_support' annotation instead of hard-coded thresholds, this might change the last counts.

Integrated testing

Performed different tests and the results are in Phabricator {T5184#87479}

Conflicts: resources/analysisTools/snvPipeline/confidenceAnnotation_SNVs.py resources/analysisTools/snvPipeline/filter_PEoverlap.py resources/analysisTools/snvPipeline/snvAnnotation.sh resources/configurationFiles/analysisSNVCalling.xml

resources/analysisTools/snvPipeline/confidenceAnnotation_SNVs.py

resources/analysisTools/snvPipeline/filter_PEoverlap.py

resources/analysisTools/snvPipeline/in_dbSNPcounter.pl

resources/analysisTools/snvPipeline/intermutationDistance_Coord_color.r

resources/analysisTools/snvPipeline/snvAnnotation.sh

resources/analysisTools/snvPipeline/confidenceAnnotation_SNVs.py

resources/analysisTools/snvPipeline/in_dbSNPcounter.pl

resources/analysisTools/snvPipeline/snvAnnotation.sh

resources/analysisTools/snvPipeline/PurityReloaded.py

resources/analysisTools/snvPipeline/createErrorPlots.py

resources/analysisTools/snvPipeline/environments/tbi-lsf-cluster.sh

resources/analysisTools/snvPipeline/filter_vcf.sh

resources/analysisTools/snvPipeline/in_dbSNPcounter.pl

resources/analysisTools/snvPipeline/snvCalling.sh

resources/configurationFiles/analysisSNVCallingGRCh38.xml

Conflicts: README.md resources/analysisTools/snvPipeline/environments/tbi-lsf-cluster.sh resources/analysisTools/snvPipeline/filter_PEoverlap.py resources/configurationFiles/analysisSNVCalling.xml

vinjana · 2023-06-27T13:16:28Z

resources/analysisTools/snvPipeline/PurityReloaded.py

@@ -251,6 +251,10 @@ def parseVcf(file,num):
 	while (l!= ""):
 		t=l.split('\t')
 		if (t[0][0] != "#") and isValid(t):
+			# Skipping the non-primary assembly variants from purity calculations
+			if t[0].startswith('HLA') or t[0].endswith('_alt'):


t -> fields
l -> line

BTW (2 lines up): line[0] != "#" is much clearer than fields[0][0] != "#" (not to speak of t[0][0]).

vinjana · 2023-06-27T13:20:27Z

resources/analysisTools/snvPipeline/PurityReloaded.py

+			# Skipping the non-primary assembly variants from purity calculations
+			if t[0].startswith('HLA') or t[0].endswith('_alt'):
+				l=vcf.readline()
+				continue


What is i (next line)? Maybe mapped_chromosome?

vinjana · 2023-06-27T13:28:32Z

resources/analysisTools/snvPipeline/createErrorPlots.py

@@ -199,6 +199,10 @@ def calculateErrorMatrix(vcfFilename, referenceFilename, errorType):
 			# 23.05.2016 JB: Excluded multiallelic SNVs
 			if ',' in split_line[header.index("ALT")]: continue

+			# 21.02.2023 NP: Excluded SNVs with 'N' before or after "," in context


Could you add a short word about why you exclude these? I guess it is not obvious because it was added to the SNV workflow only after years of operation.

vinjana · 2023-06-27T13:29:35Z

resources/analysisTools/snvPipeline/filter_PEoverlap.py

@@ -1,3 +1,483 @@
+<<<<<<< HEAD


Merge error.

resources/analysisTools/snvPipeline/filter_PEoverlap.py

vinjana · 2023-06-29T08:03:11Z

resources/analysisTools/snvPipeline/filter_PEoverlap.py

+    # Reference file for CRAM files
+    reference_file = args.refFileName


AFAICS reference_file is only used once, in the CRAM branch of the if-else block below. Moving it down will make it clearer

Suggested change

# Reference file for CRAM files

reference_file = args.refFileName

vinjana · 2023-06-29T08:04:29Z

resources/analysisTools/snvPipeline/filter_PEoverlap.py

+        samfile = pysam.Samfile(args.alignmentFile, mode)
+    elif args.alignmentFile.split(".")[-1] == "cram":
+        mode += "c"
+        samfile = pysam.Samfile(args.alignmentFile, mode, reference_filename = reference_file)


Suggested change

samfile = pysam.Samfile(args.alignmentFile, mode, reference_filename = reference_file)

# CRAM needs a reference file.

samfile = pysam.Samfile(args.alignmentFile, mode, reference_filename = args.refFileName)

vinjana · 2023-06-29T08:15:40Z

resources/configurationFiles/analysisSNVCallingGRCh38.xml

+    <cvalue name='CHROMOSOME_LENGTH_FILE' value='${hg38BaseDirectory}/stats/GRCh38_decoy_ebv_alt_hla_phiX.fa.chrLength.tsv' type="path" />
+
+    <cvalue name="CHROMOSOME_INDICES" value="( 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 X Y HLA ALT )" type="bashArray" description="Chr indices for the calling"/>
+    <cvalue name="CHROMOSOME_HLA_CONTIGS_FILE" value='${hg38BaseDirectory}/stats/GRCh38_decoy_ebv_alt_hla_phiX.fa.HLA_contigs.bed' type="path" description="HLA contig list"/>


Could you add a description field with a short description of the file content and format? If I saw that correctly, it should be suited for samtools mpileup -R $file, right? That would already be a valuable information. Same for the ALT file.

vinjana · 2023-06-29T08:32:51Z

resources/analysisTools/snvPipeline/vcf_pileup_compare_allin1_basecount.pl

 		$ccoord = $ctrl[1];
-		if ($tcoord == $ccoord)	# matching pair found!
+		if ($tcoord == $ccoord && $current_t_chr eq $current_c_chr)	# matching pair found!


If I see this correctly, the algorithm takes the two position-sorted VCFs and steps forward the feature -- either tumor or control VCF -- that is expected "earlier" in the algorithm, until it finds a match. Right?

I think, here and elsewhere, it would make more sense and would make the code clearer, if the chromosome comparison came before the coordinate comparison, i.e.

Suggested change

if ($tcoord == $ccoord && $current_t_chr eq $current_c_chr) # matching pair found!

if ($current_t_chr eq $current_c_chr && $tcoord == $ccoord) # matching pair found!

vinjana · 2023-06-29T08:38:48Z

resources/analysisTools/snvPipeline/snvsPerChromPlot.r

@@ -48,6 +48,7 @@ dat$chromosome <- factor(dat$chromosome, levels = paste0("chr",c(seq(1,22),"X","


 chromLength = read.table(file = opt$chromLengthFile, header = F)
+chromLength$V1 = gsub("chr", "", chromLength$V1)


It's good practice to define a header during loading of the table. At least it would add a bit of information about the structure of the table.

vinjana · 2023-06-29T09:11:51Z

resources/analysisTools/snvPipeline/in_dbSNPcounter.pl

+	}
+	else{
+		$all++;
+		@help = split ("\t", $line);


You could also rename this to @line. I think, it also has advantages if two variables that differ only by the type, but not by the content (in principle) are called the same. Some for @head above.

vinjana · 2023-06-29T09:13:18Z

resources/configurationFiles/analysisSNVCallingGRCh38.xml

+    <cvalue name="CHROM_SIZES_FILE" value="${hg38BaseDirectory}/stats/GRCh38_decoy_ebv_alt_hla_phiX.fa.chrLenOnlyACGT_realChromosomes.tsv" type="path" />
+    <cvalue name='CHROMOSOME_LENGTH_FILE' value='${hg38BaseDirectory}/stats/GRCh38_decoy_ebv_alt_hla_phiX.fa.chrLength.tsv' type="path" />
+
+    <cvalue name="CHROMOSOME_INDICES" value="( 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 X Y HLA ALT )" type="bashArray" description="Chr indices for the calling"/>


You should add CHROMOSOME_INDICES_PLOTTING with a description.

vinjana · 2023-06-29T09:15:13Z

README.md

+* 3.0.0
+
+  * Major
+    * Support for hg38/GRCh38 reference genome and variant calling from ALT and HLA contigs.


Are there any new mandatory configuration values? At least list them. Details should be in the XML.
Was the semantics of conf. values changed?

vinjana · 2023-06-29T09:15:37Z

README.md

+  * Major
+    * Support for hg38/GRCh38 reference genome and variant calling from ALT and HLA contigs.
+  * Minor
+    * For hg38: Removing mappability and repeat elements' annotations from penalty calculations.


Are there any new optional configuration values? (at least list them)

vinjana · 2023-06-29T09:17:00Z

README.md

+    * skipREMAP: Option to remove repeat elements and mappability from confidence annotations in hg19.
+    * Removing EVS And ExAC AF from the annotations and no-control workflow filtering
+    * Support for variant calling from CRAM files
+    * Bug fix: Removing "quote" around the raw filter option `<RAW_SNV_FILTER_OPTIONS>`


Right. But this is a technical/code description. What was the effect of the bug to the user? (probably that multiple RAW_SNV_FILTER_OPTIONS could not be used).

vinjana · 2023-11-20T09:19:13Z

resources/analysisTools/snvPipeline/filter_PEoverlap.py

Could you try limiting the (heap) memory consumption of this Python process. See here.

vinjana · 2023-11-20T09:21:55Z

resources/analysisTools/snvPipeline/PurityReloaded.py

This is passing over the VCF once. Not sure how much memory it uses, but maybe it is worthwhile limiting the memory with this approach.

NagaComBio added 20 commits August 19, 2019 09:38

Removing ExAC and EVS from annotation and no-control filtering

ba8b0f5

Removing DUKE, DAC and HiSeqDepth based confidence annotaions

8584941

Removing getRefGenomeAndChrPrefixFromHeader functions

7e49b5d

Updating annotation file paths to hg38 directory

ce97964

Removing the hg38 encode blacklist filter

c514297

Variant calling from CRAM files

3bebd1f

Checking for nonREFnonALT in BoolCounter class

b0ac261

updating pysam to 0.16.0.1

674900f

Moving files to ngs_share

128c1c0

Updating BoolCounter class

2de0727

REF name via BAM header

d8f3cf1

Reverting generic xml to hg19

b99be8f

New xml for GRCh38 files

98bb76e

chrLength file with 'chr' prefix

90b9978

Removing hard-coded header parsing

58b77cd

Liftover local control for hg38 WES and WGS

1af4b68

hg19 specific annotations

baabd82

User defined threshold-based 'SNP_support_germline' annotations

5a764ef

Removed extra spaces

da88940

Merge branch 'master' into hg38

ad3f650

Conflicts: resources/analysisTools/snvPipeline/confidenceAnnotation_SNVs.py resources/analysisTools/snvPipeline/filter_PEoverlap.py resources/analysisTools/snvPipeline/snvAnnotation.sh resources/configurationFiles/analysisSNVCalling.xml

NagaComBio requested review from GWarsow and vinjana and removed request for GWarsow March 16, 2021 10:34

vinjana reviewed Apr 12, 2021

View reviewed changes

NagaComBio added 5 commits April 26, 2021 17:03

Uncommenting reference detection

aef5111

updating refgenome help

22a55ec

Raise error: unknown alignment suffix

71900b3

Reformatting in_dbSNPcounter.pl

dbd835d

Reformatting IMD R file

df184d9

vinjana reviewed May 4, 2021

View reviewed changes

resources/analysisTools/snvPipeline/in_dbSNPcounter.pl Outdated Show resolved Hide resolved

resources/analysisTools/snvPipeline/snvAnnotation.sh Show resolved Hide resolved

NagaComBio added 20 commits May 4, 2022 09:51

Add variant calling in HLA/ALT contigs

d270358

Update ngs_share path

b683df2

Remove RE/MAP from hg38 penalties

1c0bc26

Add m2e2,HLA/ALT mappability files

6d47982

Merge branch 'mappability branch' into hg38

5fcc9e8

Fix the diagnostic plots for GRCh38

01567ac

Remove the quote for RAW_SNV_FILTER_OPTIONS

6321e0a

Upgrade to gencodev39 for hg38

0adf1e4

Merge branch 'master' into hg38

72a4135

Bug fix with dbSNP counter

c4fb48b

Update WGS local control

dc2ca36

hg38: Add local control and gnomAD based confidence annotation

973ae9c

Update README

56aa726

Exempt classification with FREQ

da45d2a

Reverting SNP based confidence scoring

45fa141

Add exception for 'N' in createErrorPlots.py

5479faf

Remove NA values in quantile calculation

f633856

Update reference

ea7154b

Update raw_filter_punishment in accordance with RAW_SNV_FILTER_OPTIONS

3f8419c

Move python env

13bdc5f

vinjana reviewed Apr 4, 2023

View reviewed changes

NagaComBio added 5 commits June 12, 2023 12:21

Update virtual env path

e2f7db9

XML comments to description

44614b6

Merge branch 'master' into hg38

dfff910

Conflicts: README.md resources/analysisTools/snvPipeline/environments/tbi-lsf-cluster.sh resources/analysisTools/snvPipeline/filter_PEoverlap.py resources/configurationFiles/analysisSNVCalling.xml

Minor update

3a47349

Update readme with hg38 calls

b8d4664

vinjana reviewed Jun 27, 2023

View reviewed changes

Fixing the missed conflict

4dca636

vinjana reviewed Jun 29, 2023

View reviewed changes

vinjana reviewed Nov 20, 2023

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Hg19 and hg38 analysis from same workflow branch #8

Hg19 and hg38 analysis from same workflow branch #8

NagaComBio commented Mar 16, 2021

vinjana Jun 27, 2023

vinjana Jun 27, 2023

vinjana Jun 27, 2023

vinjana Jun 27, 2023

vinjana Jun 29, 2023

vinjana Jun 29, 2023

vinjana Jun 29, 2023

vinjana Jun 29, 2023

vinjana Jun 29, 2023

vinjana Jun 29, 2023

vinjana Jun 29, 2023

vinjana Jun 29, 2023

vinjana Jun 29, 2023

vinjana Jun 29, 2023

vinjana Nov 20, 2023

vinjana Nov 20, 2023

		# Reference file for CRAM files
		reference_file = args.refFileName

	samfile = pysam.Samfile(args.alignmentFile, mode, reference_filename = reference_file)
	# CRAM needs a reference file.
	samfile = pysam.Samfile(args.alignmentFile, mode, reference_filename = args.refFileName)

	if ($tcoord == $ccoord && $current_t_chr eq $current_c_chr) # matching pair found!
	if ($current_t_chr eq $current_c_chr && $tcoord == $ccoord) # matching pair found!

		@@ -48,6 +48,7 @@ dat$chromosome <- factor(dat$chromosome, levels = paste0("chr",c(seq(1,22),"X","


		chromLength = read.table(file = opt$chromLengthFile, header = F)
		chromLength$V1 = gsub("chr", "", chromLength$V1)

Hg19 and hg38 analysis from same workflow branch #8

Are you sure you want to change the base?

Hg19 and hg38 analysis from same workflow branch #8

Conversation

NagaComBio commented Mar 16, 2021

Main updates

Integrated testing

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment